Skip to content

ENH: add masked algorithm for mean() function #34814

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jan 1, 2021
Merged

Conversation

Akshatt
Copy link
Contributor

@Akshatt Akshatt commented Jun 16, 2020

@pep8speaks
Copy link

pep8speaks commented Jun 16, 2020

Hello @Akshatt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-12-28 13:40:51 UTC

@jorisvandenbossche
Copy link
Member

See here:

if name in {"sum", "prod", "min", "max"}:
op = getattr(masked_reductions, name)
return op(data, mask, skipna=skipna, **kwargs)

You can add "mean" there

@Akshatt Akshatt marked this pull request as ready for review June 16, 2020 08:48
@Akshatt
Copy link
Contributor Author

Akshatt commented Jun 16, 2020

@jorisvandenbossche,
The new mean function takes significantly less time
Screenshot (189)

Should tests be added? and if yes then where?
and same for whatsnew entry

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add tests (we likely aready have some setup for sum, you can just add onto those)

@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jun 16, 2020
@jreback
Copy link
Contributor

jreback commented Jun 16, 2020

can you add some asv's for the existing and this new case (perf metrics)

@Akshatt
Copy link
Contributor Author

Akshatt commented Jun 17, 2020

@jreback

pls add tests (we likely aready have some setup for sum, you can just add onto those)

Alright, so I need to add tests in

can you add some asv's for the existing and this new case (perf metrics)

Where exactly do I add the asv's in?

@jorisvandenbossche
Copy link
Member

This operation is already covered by

class NanOps:
params = [
[
"var",
"mean",
"median",
"max",
"min",
"sum",
"std",
"sem",
"argmax",
"skew",
"kurt",
"prod",
],
[10 ** 3, 10 ** 6],
["int8", "int32", "int64", "float64", "Int64", "boolean"],
]
param_names = ["func", "N", "dtype"]
def setup(self, func, N, dtype):
if func == "argmax" and dtype in {"Int64", "boolean"}:
# Skip argmax for nullable int since this doesn't work yet (GH-24382)
raise NotImplementedError
self.s = Series([1] * N, dtype=dtype)
self.func = getattr(self.s, func)
def time_func(self, func, N, dtype):
self.func()

(I added Int64 in general for all ops, so including mean, when I added a masked sum/prod algorithm)

So no need to write additional benchmarks I think.

For tests, it would be good to add it here:

def test_ops_consistency_on_empty(self, method):
(we mainly need to check the behaviour on empty)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 17, 2020

We are also going to need to decide what to return for an empty array/Series. Right now we return np.nan, but this can also be pd.NA. This touches a bit the discussion we had about NaN vs NA for the floating dtype, but I suppose for now we probably want to go with pd.NA (that's what we do now).

@Akshatt
Copy link
Contributor Author

Akshatt commented Jun 17, 2020

@jorisvandenbossche,

We are also going to need to decide what to return for an empty array/Series. Right now we return np.nan, but this can also be pd.NA. This touches a bit the discussion we had about NaN vs NA for the floating dtype, but I suppose for now we probably want to go with pd.NA (that's what we do now).

This would involve

  • checking if the mean is np.nan and then
  • returning pd.NA instead in the mean function itself
    yes?

@dsaxton
Copy link
Member

dsaxton commented Sep 15, 2020

This would involve

  • checking if the mean is np.nan and then
  • returning pd.NA instead in the mean function itself
    yes?

@Akshatt You could also do a length check before any calculation as with other functions here, e.g. something like not values.size

@Akshatt
Copy link
Contributor Author

Akshatt commented Sep 15, 2020

This would involve

  • checking if the mean is np.nan and then
  • returning pd.NA instead in the mean function itself
    yes?

@Akshatt You could also do a length check before any calculation as with other functions here, e.g. something like not values.size

Hi @dsaxton, thanks for replying!

if not skipna: 
     if mask.any() or not values.size:
         return libmissing.NA

This would suffice, right?

@dsaxton
Copy link
Member

dsaxton commented Sep 15, 2020

@Akshatt Since we're always returning NA on empty input regardless of skipna could also

if not values.size:
    return pd.NA

# logic for non-empty input

@@ -688,6 +688,12 @@ def test_ops_consistency_on_empty(self, method):
result = getattr(Series(dtype=float), method)()
assert pd.isna(result)

# Empty Mean
if method == "mean":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already being tested directly above? Probably want tests for non-empty input also (i.e., non-empty with some missing, non-empty with no missing, non-empty with all missing, parametrizing over skipna).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes redundant now. Previously np.nan was being returned for empty values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should i add those tests here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cases would I hope already be tested for Series but maybe doesn't hurt to add some more tests under /pandas/tests/series/test_arithmetic.py (they could always be deduplicated later if already tested) since I believe masked arrays boxed inside Series hit this path. You could test means of both nullable integer and boolean dtypes.

As a follow-up enhancement would also be nice to use this to directly implement mean on masked arrays:

[ins] In [1]: import pandas as pd

[ins] In [2]: arr = pd.array([1, 2, 3])

[ins] In [3]: arr.sum()
Out[3]: 6

[ins] In [4]: arr.mean()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-057c7202e3e0> in <module>
----> 1 arr.mean()

AttributeError: 'IntegerArray' object has no attribute 'mean'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cases would I hope already be tested for Series but maybe doesn't hurt to add some more tests under /pandas/tests/series/test_arithmetic.py (they could always be deduplicated later if already tested) since I believe masked arrays boxed inside Series hit this path. You could test means of both nullable integer and boolean dtypes.

I'm not sure exactly where to add the tests? In /pandas/tests/series/test_arithmetic.py should I add them in Unsorted section for now?
Also, I would be adding tests for:

  1. Nullable integer dtype
  2. Nullable boolean dtype
    yes?
    and how do I check if the tests are passing?

As a follow-up enhancement would also be nice to use this to directly implement mean on masked arrays:

[ins] In [1]: import pandas as pd

[ins] In [2]: arr = pd.array([1, 2, 3])

[ins] In [3]: arr.sum()
Out[3]: 6

[ins] In [4]: arr.mean()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-057c7202e3e0> in <module>
----> 1 arr.mean()

AttributeError: 'IntegerArray' object has no attribute 'mean'

yes that makes sense, although I'm not sure how to go about doing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also probably good to merge master since it's been a while

Copy link
Contributor Author

@Akshatt Akshatt Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsaxton
I have added these tests under

def test_ops_consistency_on_empty(self, method):

image

Any quick fixes? also, how do I ensure these tests are passing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dsaxton,
pushed to branch named tests. Please have a look

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Akshatt please push it to this branch (the same as this PR is based on, so which is your master branch), then we can see the tests here in the PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry @Akshatt if that was confusing. I meant in the future it's good to avoid modifying your master branch, not to work on a separate branch for this PR.

Regarding your test you'll also want to explicitly assert result is pd.NA which (oddly enough) is slightly different from pd.isna(result).

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 25, 2020
@dsaxton
Copy link
Member

dsaxton commented Oct 25, 2020

@Akshatt Is this still active? Can you address previous comments if so?

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

adding min_count ?

As I already answered, min_count is not applicable here. So I think it is only the whatsnew note that remains code-wise

this is fine (you are repeating an old commment which i agree is not relevant).

the asv's and whatsnew are still needed.

@jorisvandenbossche
Copy link
Member

you are repeating an old commment which i agree is not relevant

Jeff, I am answering @Akshatt because he asked exactly that 2 hours ago

@jorisvandenbossche
Copy link
Member

Since setting up ASV is not easiest thing to do, I quickly ran the aforementioned benchmark that covers this case (asv run --python=same -b "series_methods\.NanOps\.time_func\('mean'" on master vs this branch):

master:

[100.00%] ··· series_methods.NanOps.time_func              ok
[100.00%] ··· ======== ========= ============ ============= ============= ============= ============ ============
              --                                                      dtype                                      
              ------------------ --------------------------------------------------------------------------------
                func       N         int8         int32         int64        float64       Int64       boolean   
              ======== ========= ============ ============= ============= ============= ============ ============
                mean      1000    36.3±0.3μs    21.2±0.1μs    21.0±0.1μs   21.0±0.06μs   40.7±0.2μs   46.1±0.2μs 
                mean    1000000    578±3μs     1.06±0.04ms   1.14±0.04ms    1.26±0.2ms   1.87±0.2ms   1.63±0.2ms 
              ======== ========= ============ ============= ============= ============= ============ ============

this branch:

[100.00%] ··· ======== ========= =========== ============ ============ ============ ============= =============
              --                                                     dtype                                     
              ------------------ ------------------------------------------------------------------------------
                func       N         int8       int32        int64       float64        Int64        boolean   
              ======== ========= =========== ============ ============ ============ ============= =============
                mean      1000     47.0±5μs   23.2±0.9μs    25.1±3μs    22.5±0.7μs     15.6±1μs      19.3±1μs  
                mean    1000000   779±200μs   1.23±0.3ms   1.36±0.3ms   1.46±0.4ms   1.45±0.08ms   1.38±0.03ms 
              ======== ========= =========== ============ ============ ============ ============= =============

So you see a modest speedup for the Int64 / boolean case (although apparently for larger data, the relative benefit diminishes)

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

you are repeating an old commment which i agree is not relevant

Jeff, I am answering @Akshatt because he asked exactly that 2 hours ago

as did i but ok

@Akshatt
Copy link
Contributor Author

Akshatt commented Dec 24, 2020

Hi, sorry to have caused all the misunderstanding.
@jorisvandenbossche, thanks for running the asv benchmark as I have no experience in doing it.
So, all that's left is to add the entry in the Whats new in v1.3.0.rst under Performance improvements like so
" Performance improvement in :meth:array.mean (:issue:34814) ". is this okay? Should I attach the asv results as well?

@jorisvandenbossche
Copy link
Member

So, all that's left is to add the entry in the Whats new in v1.3.0.rst under Performance improvements like so
" Performance improvement in :meth:array.mean (:issue:34814) ".

Yes, that's all.

@jreback jreback merged commit 2a60c56 into pandas-dev:master Jan 1, 2021
@jreback
Copy link
Contributor

jreback commented Jan 1, 2021

thanks @Akshatt

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: add masked algorithm for mean()
7 participants